2022-12-10
In parsing data sets to find one usable for the purposes of this project, our group settled upon using an extensive database of spotify songs (tracks) (232,725 entries across 18 features). The features include a number of acoustic characteristics (loudness, tempo, valence, time signature, etc.), a popularity metric (from 0 to 100), and other identifying features like genre and names of the artist and track. All of the members of our team are deeply interested in music and saw this data set, given its expansiveness, as a good platform to build our work upon.
The dataset can be found here: Kaggle
- Spotify Tracks DB
With additional credit for work to grab
the data: Github -
Spotify Data Project
Our group settled on this question as popularity was the metric we saw as most likely determined based on the other features present in the selected dataset. Stepping back from the specific data at hand, as well, popularity is a metric that is pertinent both to businesses - be that independent artists or recording labels - and us as consumers - as popularity undeniably plays a role in what we consume; the relevance to both artist and consumer make a worthy subject for a our research question.
Per the Spotify Developer API (linked here) a single track has 13 unique features which are quantifiable and contribute to the audio profile. These are detailed in the table below with descriptions and possible value ranges.
There are 5 additional features that are returned by a GET call to the Spotify Dev API that are useful in labeling and managing data throughout the analytics process but are not of predictive value. These are detailed in second table below. Not all of these misc. features are present in the kaggle dataset used, but were included here for complete context.
| Characterstic | Description | Value Range | |
|---|---|---|---|
| 14 | analysis_url | A URL to access the full audio analysis of this track. An access token is required to access this data. | NA |
| 15 | id | The Spotify ID for the track. | NA |
| 16 | track_href | A link to the Web API endpoint providing full details of the track. | NA |
| 17 | type | The object type. Allowed value:“audio_features” | NA |
| 18 | uri | The Spotify URI for the track. | NA |
From our research there was one example of a pre-existing project that was targeting the same question as we were. The work in question was conducted by Matt Devor; operating based on the same 13 characteristics we also identified from the API, Devor performed a range of data analysis and built towards running a Linear Regression model to directly predict popularity. Devor’s work is limited in the nature of the concluding model, but in lieu of that it does serve as an extensive reference for exploratory data analysis which our team looks to build upon in our work.
Reference: Github - Predicting Spotify Song Popularity
In order to address the landed upon question from above - exploring the predictive power of the acoustic characteristics of an audio track to determine a track’s popularity - in a manner unique from existing work our team will use a random forest model; the model will perform regression over the continuous target variable popularity. Through exploratory analytics and further method decisions we seek to create a model with a sub 5% NMRSE for potential use in a real-world business context.
## [1] "Initial mtry value: 3.60555127546399"
## [1] "RMSE of our first RF model is: 15.4594981812402"
| Overall | |
|---|---|
| acousticness | 3841.0445 |
| danceability | 2250.1343 |
| duration_ms | 2993.5943 |
| energy | 2658.1052 |
| instrumentalness | 1599.8494 |
| key | 3672.9501 |
| liveness | 2068.6330 |
| loudness | 3687.1308 |
| mode | 200.6078 |
| speechiness | 2003.7502 |
| tempo | 1661.8719 |
| time_signature | 377.4922 |
| valence | 2046.0900 |
To optimize the model we tuned the mtry hyperparameter. We found that when run on our tune dataset 2 was a more accurate mtry value as opposed to 4 when we had originally created the model. In response to this we created another model with the mtry set to 2. This gave better results (both rmse and nrmse) when run on our test dataset.
## mtry = 4 OOB error = 231.1599
## Searching left ...
## mtry = 2 OOB error = 229.1461
## 0.008711894 0.05
## Searching right ...
## mtry = 12 OOB error = 232.2126
## -0.004553863 0.05
## [1] "RMSE of original model: 15.4594981812402"
## [1] "RMSE of optimized model: 14.5223043796896"
## [1] "pop_RF_2 rmse: 14.1544440007784"
## [1] "pop_RF_2 test NRMSE: 0.164586558148585"
We followed the same steps as above when working with 3 subsets of our original dataset: rap music, jazz music, and country music.
## [1] "Starting mtry for rap dataset: 3.60555127546399 rounds to 4"
## [1] "RMSE for first Rap RF model: 8.07926310206631"
## mtry = 4 OOB error = 67.62196
## Searching left ...
## mtry = 2 OOB error = 67.34297
## 0.004125716 0.05
## Searching right ...
## mtry = 12 OOB error = 68.69809
## -0.01591392 0.05
## [1] "rap_RF_2 RMSE: 8.11214128973794"
## [1] "rap_RF_2 NRMSE: 0.124802173688276"
##
## Call:
## randomForest(formula = as.numeric(popularity) ~ ., data = rap_train, ntree = 500, mtry = 5, replace = TRUE, sampsize = 400, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 65.80684
## % Var explained: 0.91
## [1] "rap_RF_2 test RMSE: 8.17224959688551"
## [1] "rap_RF_2 test NRMSE: 0.0961441129045354"
## [1] "RMSE of first Jazz Model: 9.10492653612027"
## mtry = 4 OOB error = 86.96277
## Searching left ...
## mtry = 2 OOB error = 83.29836
## 0.04213769 0.05
## Searching right ...
## mtry = 8 OOB error = 85.91315
## 0.01206981 0.05
## [1] "jazz_RF_2 RMSE: 9.11880324260826"
## [1] "jazz_RF_2 NRMSE: 0.124915112912442"
## [1] "jazz_RF_r test RMSE: 9.51695550165936"
## [1] "jazz_RF_2 test NRMSE: 0.120467791160245"
## [1] "RMSE of first Country Model: 9.50881642786787"
## mtry = 4 OOB error = 94.18628
## Searching left ...
## mtry = 2 OOB error = 91.8066
## 0.02526563 0.05
## Searching right ...
## mtry = 8 OOB error = 95.83211
## -0.01747426 0.05
## [1] "country_RF_2 RMSE: 9.50399215928269"
## [1] "country_RF_2 NRMSE: 0.120303698218768"
## [1] "country_RF_r test RMSE: 9.66801352692891"
## [1] "country_RF_2 test NRMSE: 0.117902603986938"